Introduction

This IPython notebook illustrates how to sample and label a table (candidate set). First, we need to import py_entitymatching package and other libraries as follows:



In [1]:

    
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd



In [2]:

    
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

path_A = datasets_dir + os.sep + 'DBLP.csv'
path_B = datasets_dir + os.sep + 'ACM.csv'
path_C = datasets_dir + os.sep + 'tableC.csv'



In [3]:

    
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
C = em.read_csv_metadata(path_C, key='_id', 
                         fk_ltable='ltable_id', fk_rtable='rtable_id',
                         ltable=A, rtable=B)









    



Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.



In [4]:

    
C.head()









    Out[4]:






  
    
      
      _id
      ltable_id
      rtable_id
      ltable_authors
      ltable_title
      rtable_authors
      rtable_title
    
  
  
    
      0
      0
      conf/sigmod/AbadiC02
      191915
      Daniel J. Abadi, Mitch Cherniack
      Visual COKO: a debugger for query optimizer development
      Michael J. Carey, David J. DeWitt, Michael J. Franklin, Nancy E. Hall, Mark L. McAuliffe, Jeffre...
      Shoring up persistent applications
    
    
      1
      1
      conf/sigmod/AbadiC02
      191931
      Daniel J. Abadi, Mitch Cherniack
      Visual COKO: a debugger for query optimizer development
      Daniel J. Dietterich
      DEC data distributor: for data replication and data warehousing
    
    
      2
      2
      conf/sigmod/AbadiC02
      233356
      Daniel J. Abadi, Mitch Cherniack
      Visual COKO: a debugger for query optimizer development
      Mitch Cherniack, Stanley B. Zdonik
      Rule languages and internal algebras for rule-based optimizers
    
    
      3
      3
      conf/sigmod/AbadiC02
      276311
      Daniel J. Abadi, Mitch Cherniack
      Visual COKO: a debugger for query optimizer development
      Mitch Cherniack, Stan Zdonik
      Changing the rules: transformations for rule-based optimizers
    
    
      4
      4
      conf/sigmod/AbadiC02
      335432
      Daniel J. Abadi, Mitch Cherniack
      Visual COKO: a debugger for query optimizer development
      Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang
      NiagaraCQ: a scalable continuous query system for Internet databases



In [5]:

    
len(C)









    Out[5]:





14673

Sample Candidate Set

From the candidate set, a sample (for labeling purposes) can be obtained like this:



In [6]:

    
S = em.sample_table(C, 450)

Label the Sampled Set



In [7]:

    
# Label the sampled set
# Specify the name for the label column
G = em.label_table(S, 'gold_label')









    



Column name (gold_label) is not present in dataframe






    



---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
/Users/pradap/miniconda3/lib/python3.5/site-packages/py_entitymatching-0.1.0-py3.5.egg/py_entitymatching/labeler/labeler.py in label_table(table, label_column_name, verbose)
     64     try:
---> 65         from PyQt4 import QtGui
     66     except ImportError:

ImportError: No module named 'PyQt4'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
<ipython-input-7-a863790c9d34> in <module>()
      1 # Label the sampled set
      2 # Specify the name for the label column
----> 3 G = em.label_table(S, 'gold_label')

/Users/pradap/miniconda3/lib/python3.5/site-packages/py_entitymatching-0.1.0-py3.5.egg/py_entitymatching/labeler/labeler.py in label_table(table, label_column_name, verbose)
     65         from PyQt4 import QtGui
     66     except ImportError:
---> 67         raise ImportError('PyQt4 is not installed. Please install PyQt4 to use '
     68                       'GUI related functions in py_entitymatching.')
     69 

ImportError: PyQt4 is not installed. Please install PyQt4 to use GUI related functions in py_entitymatching.

The user must specify 0 for non-match and 1 for match. Typically, the sampling and the labeling step is done in iterations (till we get sufficient density of matches). Once labeled, the labeled data set will look like this:



In [ ]:

    
# Assume that we have labeled the data and stored it in 
# labeled_data_demo.csv

path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'
G = em.read_csv_metadata(path_labeled_data, key='_id', 
                         fk_ltable='ltable_id', fk_rtable='rtable_id',
                         ltable=A, rtable=B)



In [ ]:

    
G.head()

	_id	ltable_id	rtable_id	ltable_authors	ltable_title	rtable_authors	rtable_title
0	0	conf/sigmod/AbadiC02	191915	Daniel J. Abadi, Mitch Cherniack	Visual COKO: a debugger for query optimizer development	Michael J. Carey, David J. DeWitt, Michael J. Franklin, Nancy E. Hall, Mark L. McAuliffe, Jeffre...	Shoring up persistent applications
1	1	conf/sigmod/AbadiC02	191931	Daniel J. Abadi, Mitch Cherniack	Visual COKO: a debugger for query optimizer development	Daniel J. Dietterich	DEC data distributor: for data replication and data warehousing
2	2	conf/sigmod/AbadiC02	233356	Daniel J. Abadi, Mitch Cherniack	Visual COKO: a debugger for query optimizer development	Mitch Cherniack, Stanley B. Zdonik	Rule languages and internal algebras for rule-based optimizers
3	3	conf/sigmod/AbadiC02	276311	Daniel J. Abadi, Mitch Cherniack	Visual COKO: a debugger for query optimizer development	Mitch Cherniack, Stan Zdonik	Changing the rules: transformations for rule-based optimizers
4	4	conf/sigmod/AbadiC02	335432	Daniel J. Abadi, Mitch Cherniack	Visual COKO: a debugger for query optimizer development	Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang	NiagaraCQ: a scalable continuous query system for Internet databases